[INSERT PRESENTATION VIDEO HERE]

1. PREPARE

Research Questions

Recall from our presentation that one central question to text mining and natural language processing is:

How do we to quantify what a document or collection of documents is about?

For our first lab on text mining in STEM education, we’ll explore this question by examining a corpus, or collection, of public posts on Twitter about the Common Core State Standards (CCSS) to better understand public discourse surrounding these standards, particularly as they relate to math education. Specifically, in this lab we’ll be applying some basic text mining techniques to address the following questions:

  1. What are the most frequent words or phrases used in reference to tweets about the CCSS?

  2. What words and hashtags commonly co-occur together, particularly with the word “math?”

Section Objectives

To help us better understand the packages, Twitter API tools, and data we’ll use during this lab to address these questions, in this section we’ll learn to:

  1. Load Packages for tidy text mining and using Twitter APIs

  2. Create a Twitter App to obtain authentication credentials, also known as keys and tokens

  3. Authorize RStudio to use your app for retrieving data from Twitter

Exercise RMarkdown File

Instructions for accessing practice file and completing lab…

1a. Load Packages

Prior Packages

Let’s begin by loading some familiar packages from previous Learning Labs that we’ll be using for data wrangling and exploration :

library(dplyr)
library(readr)
library(tidyr)
library(ggplot2)
library(readxl)
library(writexl)
library (DT)

📦 The rtweet Package

The rtweet package provides users a range of functions designed to extract data from Twitter’s REST and streaming APIs and has three main goals:

  1. Formulate and send requests to Twitter’s REST and stream APIs.

  2. Retrieve and iterate over returned data.

  3. Wrangling data into tidy structures.

Let’s load the rtweet package which we’ll be using later in this lab to accomplish all three of the goals listed above:

library(rtweet)

1b. Create a Twitter App

Before you can begin pulling tweets into R, you’ll first need to create a Twitter App in your developer account. This section and the section that follows, are borrowed largely from the rtweet package by Michael Kearney, and requires that you have set up a Twitter developer account.

You are not required to set up developer account for this institute, but if you are still interested in creating one, these instructions succinctly outline the process and you can set one up in about 10 minutes. We have provided the data we’ll be using for the Wrangling and Explore parts of the lab on our GitHub repository and you can skip to section 2b. Tidy Text if you are not interested in or unable to set up a Twitter developer account.

Steps for Creating your Twitter App

  1. Navigate to developer.twitter.com/en/apps, click the blue button that says, Create a New App, and then complete the form with the following fields:

    • App Name: What your app will be called

    • Application Description: How your app will be described to its users

      create-app-1

    • Website URLs: Website associated with app–I recommend using the URL to your Twitter profile

    • Callback URLs: IMPORTANT enter exactly the following: http://127.0.0.1:1410

      create-app-2

    • Tell us how this app will be used: Be clear and honest

      create-app-3

  2. When you’ve completed the required form fields, click the blue Create button at the bottom

  3. Read through and indicate whether you accept the developer terms

    create-app-4

  4. And you’re done!

    create-app-5

1c. Authorize RStudio

In order to authorize R to use your Twitter App to retrieve data, you’ll need to create a personal Twitter token by completing the following steps:

  • Navigate to developer.twitter.com/en/apps and select your Twitter app
  • Click the tab labeled Keys and tokens to retrieve your keys.
  • Locate the Consumer API keys (aka “API Secret”).

create-app-6

  • Scroll down to Access token & access token secret and click Create

create-app-7

  • Copy and paste the four keys (along with the name of your app) into an R Markdown file file and pass them along to create_token(). Note, these keys are named secret for a reason. I recommend setting up your token in a separate R Markdown file than the one that you will eventually share.
## store api keys (these are fake example values; replace with your own keys)
app_name <- "Text Mining in Education"
api_key <- "afYS4vbIlPAj096E60c4W1fiK"
api_secret_key <- "bI91kqnqFoNCrZFbsjAWHD4gJ91LQAhdCJXCj3yscfuULtNkuu"
access_token <- "9551451262-wK2EmA942kxZYIwa5LMKZoQA4Xc2uyIiEwu2YXL"
access_token_secret <- "9vpiSGKg1fIPQtxc5d5ESiFlZQpfbknEN1f1m2xe5byw7"

## authenticate via web browser
token <- create_token(
  app = app_name,
  consumer_key = api_key,
  consumer_secret = api_secret_key,
  access_token = access_token,
  access_secret = access_token_secret)

If you are interested in viewing an alternate authentication method, you can view rtweet Twitter authorization vignette by running the following code:

vignette("auth")

Authorization in future R sessions

  • The create_token() function should automatically save your token as an environment variable for you. So next time you start an R session [on the same machine], rtweet should automatically find your token.
  • To make sure it works, restart your R session, run the following code, and again check to make sure the app name and api_key match.
## check to see if the token is loaded
get_token()
## <Token>
## <oauth_endpoint>
##  request:   https://api.twitter.com/oauth/request_token
##  authorize: https://api.twitter.com/oauth/authenticate
##  access:    https://api.twitter.com/oauth/access_token
## <oauth_app> LASER Labs
##   key:    I0M2APeuHjDPqSaussmcHYQDC
##   secret: <hidden>
## <credentials> oauth_token, oauth_token_secret
## ---

That’s it!


2. WRANGLE

In general, data wrangling involves some combination of cleaning, reshaping, transforming, and merging data (Wickham & Grolemund, 2017). The importance of data wrangling is difficult to overstate, as it involves the initial steps of going from raw data to a dataset that can be explored and modeled (Krumm et al, 2018).

  1. Search & Subset. In this section, we introduce new functions from the rtweet package to search for and filter tweets and users of interest.
  2. Tidy Text. We also introduce the tidytext package to both “tidy” and tokenize our tweets in order to create our data frame for analysis.
  3. Stop Words. We conclude our data wrangling by using the now familiar dplyr package to remove words that don’t add much value to our analysis.

2a. Search & Subset

This section introduces the following functions from the rtweet package for reading Twitter data into R:

  • search_tweets() Pulls up to 18,000 tweets from the last 6-9 days matching provided search terms. 
  • search_tweets2() Returns data from multiple search queries.
  • get_timelines() Returns up to 3,200 tweets of one or more specified Twitter users.

Search Tweets

Since one of our goals for this Learning Lab and the next is a very simplistic replication of the studies by Wang and Fikis (2019), let’s begin by introducing the search_tweets() function to try reading into R 5,000 tweets containing the CommonCore hashtag and store as a new data frame called ccss_tweets.

Type or copy the following code into your R Markdown file or console and run:

ccss_tweets <- search_tweets(q = "#CommonCore", n=5000)

Note that the first argument q = that the search_tweets() function expects is the search term included in quotation marks and that n = specifies the maximum number of tweets to return.

✅ Comprehension Check

View your new ccss_tweets data frame using the glimpse() function introduced previously to help answer the following questions:

  1. How many tweets did our query using the Twitter API actually return? How many variables?
  2. Why do you think our query pulled in far less than 5,000 tweets requested?
  3. Does our query also include retweets? How do you know?
  4. Does capitalization in your query matter?

Using the OR Operator

Wang and Fikis (2019) collected the tweets containing the hashtags #CommonCore and #CCSS for 12 months from 2014 to 2015. Unfortunately, a basic Twitter developer account only lets us go back about a week but retrieving tweets the two hashtags identified by the authors is not an issue.

Let’s modify our query using the OR operator to also include “CCSS” so it will return tweets containing either #NGSSchat or “ngss” and assign to ngss_or_tweets:

ccss_or_tweets <- search_tweets(q = "#commoncore OR #ccss", n=5000)
✅ Comprehension Check

Try including both search terms but excluding the OR operator to answer the following question:

  1. Does excluding the OR operator return more tweets, the same number of tweets, or fewer tweets? Why?
  2. What happens if you remove the hashtag? Does it still return tweets with CommonCore?
  3. What other useful arguments does the search_tweet() function contain? Try adding one and see what happens.

Hint: Use the ?search_tweets help function to learn more about the q argument and other arguments for composing search queries.

Use Multiple Queries

Although Wang and Fikis (2019) limited their query to the two hashtags used above, at some point you may be interested in a more complex query that includes additional search terms. Unfortunately, the OR operator only gets us so far. In order to pass multiple queries , we will need to use the c() function to combine our search terms into a single list.

Copy and past the following code to store the results of our query in ngss_tweets:

ccss_tweets <- search_tweets2(c("#commoncore OR #ccss",
                                '"common core standards"',
                                '"common core state standards"'), 
                             n=5000,
                             include_rts = FALSE)

Notice the unique syntax required for the query argument. For example, when “OR” is entered between search terms, query = "#CommonCore OR #CCSS", Twitter’s REST API should return any tweet that contains either “#CommonCore” or “#CCSS.” It is also possible to search for exact phrases using double quotes. To do this, either wrap single quotes around a search query using double quotes, e.g., q = '"common core standards"' as we did above, or escape each internal double quote with a single backslash, e.g., q = "\"common core standards\"".

To learn more about constructing search terms using the query argument, enter ?search_tweets in your console and review the documentation for the q= argument.

✅ Comprehension Check
  1. Use the search_tweets function to create you own custom query for a twitter hashtag or topic(s) of interest.

Subset Tweets

As you may have noticed, we have way more data than we need for our analysis and should probably pare it down to just what we’ll use.

First, it’s likely the authors removed retweets from their analysis since a retweet is simply a user reposting someone else’s tweet and would duplicate the exact same content of the original. It’s also likely that they limited their analysis to just English language Tweets so let’s go ahead and

Let’s use the filter() function introduced in previous labs to subset rows containing only original tweets in the English language:

ccss_tweets <- ccss_tweets %>% filter(is_retweet == "False", 
                                      lang == "en")

Now let’s use the select() function select the following columns from our new ccss_text data frame:

  1. screen_name of the user who created the tweet
  2. created_at timestamp for examining changes in sentiment over time
  3. text containing the tweet which is our primary data source of interest
ccss_tweets <- select(ccss_tweets,
                 screen_name, 
                 created_at, 
                 text)
✅ Comprehension Check

For the remainder of the lab, you’ll be asked to using your own Twitter data. Complete the following steps before proceeding to the 2b. Tidy Text section:

  1. Creates new code chunk and write a query based on a STEM area of interest.

  2. Subset your data to remove any unnecessary tweets from analysis.

  3. Assign your search to a new object called my_tweets.

  4. Output your new dataset using the datatable() function from the DT package and take a quick look.

Extra credit for using the %>% pipe operator and efficient use of arguments to keep your code succinct and using the <- assignment operator only once.

You’re output should look something like this:

Write to Excel File

Finally, let’s save our tweet files to use in later exercises since tweets have a tendency to change every minute. We’ll save as a Microsoft Excel file since one of our columns can not be stored in a flat file like .csv.

Let’s use the write_xlxs() function from the writexl package just like we would the write_csv() function from dplyr in Unit 1:

write_xlsx(ccss_tweets, "data/csss_tweets.xlsx")
✅ Comprehension Check
  1. What happens if you try to write to a flat file like .csv?

Other Useful Functions (Optional)

For your own research, you may be interest in exploring posts by specific users rather than topics, key words, or hashtags. Yes, there is a function for that too!

For example, let’s create another list containing the usernames of the LASER Institute leads using the c() function again and use the get_timelines() function to get the most recent tweets from each of those users:

fi <- c("sbkellogg", "haspires", "tarheel93", "drcallie_tweets", "AlexDreier")

fi_tweets <- fi %>%
  get_timelines(include_rts=FALSE)

Notice that you can use the pipe operator with the rtweet functions just like you would other functions from the tidyverse.

And let’s use the sample_n() function from the dplyr package to pick 10 random tweets and use select() to select and view just the screenname and text columns that contains the user and the content of their post:

sample_n(fi_tweets, 10) %>%
  select(screen_name, text)

The rtweet package also has handy ts_plot function built into rtweet to take a very quick look at how far back our data set goes:

ts_plot(ccss_tweets, by = "days")

Notice that this effectively creates a ggplot time series plot for us. I’ve included the by = argument which by default is set to “days.” It looks like tweets go back 9 days which is the rate limit set by Twitter.

Try changing it to “hours” and see what happens.

✅ Comprehension Check

To conclude Section 2a, try one of the following search functions from the rtweet vignette:

  1. get_timelines() Get the most recent 3,200 tweets from users.
  2. stream_tweets() Randomly sample (approximately 1%) from the live stream of all tweets.
  3. get_friends() Retrieve a list of all the accounts a user follows.
  4. get_followers() Retrieve a list of the accounts following a user.
  5. get_favorites() Get the most recently favorited statuses by a user.
  6. get_trends() Discover what’s currently trending in a city.
  7. search_users() Search for 1,000 users with the specific hashtag in their profile bios.

We’ve only scratched the surface of the number of functions available in the rtweets package for searching Twitter. To learn more about the rtweet package, you can find full documentation on CRAN at: <https://cran.r-project.org/web/packages/rtweet/rtweet.pdf>

Or use the following function to access the package vignette:

vignette("intro", package="rtweet")

2b. Tidy Text

📦 The tidytext Package

Text data, by it’s very nature is ESPECIALLY untidy and sometimes referred to as “unstructured.” The tidytext package provide functions and supporting data sets to allow conversion of text to and from tidy formats, and to switch seamlessly between tidy tools and existing text mining packages.

As we’ll Learn first hand later in this lab, using tidy data principles can make many text mining tasks easier, more effective, and consistent with tools already in wide use. Much of the infrastructure needed for text mining with tidy data frames already exists in tidyverse packages with which we’ve already been introduced.

For a more comprehensive introduction to the tidytext package, we cannot recommend enough the free online book, Text Mining with R: A Tidy Approach by Silge and Robinson (2018).

Let’s go ahead and load tidytext:

library(tidytext)

Attention: From this point forward, we’ll also use a shared dataset constructed with the Twitter Academic Research product track that allows for a much greater number of tweets to be accessed over a far greater period of time. This will also ensure that we’re producing similar results so we can check to see if our code is behaving as expected.

Let’s use the readxl package highlighted in Section 1 and the read_xlsx() function to read in the data stored in the data folder of our R project:

ccss_tweets <- read_csv("data/ccss-tweets.csv",
                        col_types = cols(id = col_character(), 
                                         author_id = col_character()
                                         )
                        )

Tokenize Text

In Chapter 1 of Text Mining with R, Silge and Robinson (2018) define the tidy text format as a table with one-token-per-row. A token is a meaningful unit of text, such as a word, two-word phrase (bigram), or sentence that we are interested in using for analysis. And tokenization is the process of splitting text into tokens.

This one-token-per-row structure is in contrast to the ways text is often stored for text analysis, perhaps as strings in a corpus object or in a document-term matrix. For tidy text mining, the token that is stored in each row is most often a single word, but can also be an n-gram, sentence, or paragraph.

For this part of our workflow, our goal is to transform our ccss_tweets data from this:

## # A tibble: 11,592 x 4
##    text                              id         author_id    created_at         
##    <chr>                             <chr>      <chr>        <dttm>             
##  1 "Textual Evidence in #GatheringB… 108537318… 93237206838… 2019-01-16 03:08:27
##  2 "11 PM ET Thursday live 805-285-… 108533640… 38784416     2019-01-16 00:42:18
##  3 "11 PM ET Thursday live 805-285-… 108533633… 85409136     2019-01-16 00:42:00
##  4 "Petrilli /Forham were the ones … 108532985… 253343991    2019-01-16 00:16:16
##  5 "This is rich.\nLet's look at th… 108532928… 253343991    2019-01-16 00:14:00
##  6 "This new math will always leave… 108532139… 25189615     2019-01-15 23:42:39
##  7 "Our free #lesson gives a sample… 108528177… 91497123610… 2019-01-15 21:05:13
##  8 "With a focus on the complete wr… 108528172… 281718510    2019-01-15 21:05:00
##  9 "Super excited to learn from @ju… 108526899… 395853723    2019-01-15 20:14:25
## 10 "Find ideas for how to teach &am… 108526204… 99865113846… 2019-01-15 19:46:48
## # … with 11,582 more rows

Into a “tidy text” format that looks more like this familiar tibble data structure:

## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
## # A tibble: 169,711 x 4
##    id                  author_id         created_at          word               
##    <chr>               <chr>             <dttm>              <chr>              
##  1 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 textual            
##  2 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 evidence           
##  3 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 #gatheringblue     
##  4 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 #ccss              
##  5 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 #rl71              
##  6 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 #rl72              
##  7 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 #w72               
##  8 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 #middleschool      
##  9 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 #interactivenotebo…
## 10 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 🍎📚📝✅💯❤️🚀     
## # … with 169,701 more rows

Later we’ll learn about other data structures for text analysis like the document-term matrix and corpus objects. For now, however, working with the familiar tidy data frame allows us to take advantage of popular packages that use the shared tidyverse syntax and principles for wrangling, exploring, and modeling data.

Unigrams

The tidytext package provides the incredibly powerful unnest_tokens() function to tokenize text (including tweets!) and convert them to a one-token-per-row format.

Let’s tokenize our tweets by using this function to split each tweet into a single row to make it easier to analyze:

ccss_unigrams <- unnest_tokens(ccss_tweets, 
                               output = word, 
                               input = text)

There is A LOT to unpack with this function. First notice that unnest_tokens() expects a data frame as the first argument, followed by two column names. The second argument is an output column name that doesn’t currently exist but will be created as the text is “unnested” into it (word, in this case). This is followed by the input column that the text comes from, which we uncreatively named text. Also notice:

  • By default, a token is an individual word or unigram.

  • Other columns, such as screen_name and created_at, are retained.

  • All punctuation has been removed.

  • Tokens have been changed to lowercase, which makes them easier to compare or combine with other datasets (use the to_lower = FALSE argument to turn off if desired).

The unnest_tokens() function also has specialized “tweets” tokenizer in the tokens = argument that is very useful for dealing with Twitter text in that it retains hashtags and mentions of usernames with the @ symbol.

✅ Comprehension Check

Rewrite the code below to include the token argument set to:

ccss_unigrams <- unnest_tokens(ccss_tweets, 
                              output = word, 
                              input = text, 
                              _____ = _____)

Your output should look something like this:

## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
## # A tibble: 289,390 x 4
##    id                  author_id         created_at          word               
##    <chr>               <chr>             <dttm>              <chr>              
##  1 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 textual            
##  2 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 evidence           
##  3 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 in                 
##  4 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 #gatheringblue     
##  5 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 #ccss              
##  6 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 #rl71              
##  7 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 #rl72              
##  8 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 #w72               
##  9 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 #middleschool      
## 10 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 #interactivenotebo…
## # … with 289,380 more rows
Bigrams

In the function above we specified tokens as individual words, but many interesting text analyses are based on the relationships between words, which words tend to follow others immediately, or words that tend to co-occur within the same documents.

We can also use the unnest_tokens() function to tokenize our tweets into consecutive sequences of words, called n-grams. By seeing how often word X is followed by word Y, we can then build a model of the relationships between them as well see in Section 4. MODEL.

We do this by adding the token = "ngrams" option to unnest_tokens(), and setting n to the number of words in each n-gram. Let’s set n to 2, so we can examine pairs of two consecutive words, often called “bigrams”:

ccss_bigrams <- ccss_tweets %>% 
  unnest_tokens(bigram, 
                text, 
                token = "ngrams", 
                n = 2)

ccss_bigrams
## # A tibble: 301,118 x 4
##    id              author_id        created_at          bigram                  
##    <chr>           <chr>            <dttm>              <chr>                   
##  1 10853731852631… 932372068380880… 2019-01-16 03:08:27 textual evidence        
##  2 10853731852631… 932372068380880… 2019-01-16 03:08:27 evidence in             
##  3 10853731852631… 932372068380880… 2019-01-16 03:08:27 in gatheringblue        
##  4 10853731852631… 932372068380880… 2019-01-16 03:08:27 gatheringblue ccss      
##  5 10853731852631… 932372068380880… 2019-01-16 03:08:27 ccss rl71               
##  6 10853731852631… 932372068380880… 2019-01-16 03:08:27 rl71 rl72               
##  7 10853731852631… 932372068380880… 2019-01-16 03:08:27 rl72 w72                
##  8 10853731852631… 932372068380880… 2019-01-16 03:08:27 w72 middleschool        
##  9 10853731852631… 932372068380880… 2019-01-16 03:08:27 middleschool interactiv…
## 10 10853731852631… 932372068380880… 2019-01-16 03:08:27 interactivenotebook 2ks…
## # … with 301,108 more rows


Before we move any further let’s take a quick look at the most common unigrams and bigrams in our two datasets:

ccss_unigrams %>%
  count(word, sort = TRUE)
## # A tibble: 39,703 x 2
##    word            n
##    <chr>       <int>
##  1 #commoncore  9852
##  2 the          8508
##  3 to           6858
##  4 of           4821
##  5 and          4129
##  6 a            3686
##  7 is           3567
##  8 in           3224
##  9 for          3015
## 10 this         2415
## # … with 39,693 more rows
ccss_bigrams %>% 
  count(bigram, sort = TRUE)
## # A tibble: 138,876 x 2
##    bigram                     n
##    <chr>                  <int>
##  1 https t.co             10538
##  2 commoncore https         940
##  3 common core              935
##  4 commoncore math          749
##  5 of the                   735
##  6 in the                   504
##  7 the commoncore           498
##  8 commoncore commonsense   497
##  9 of writing               492
## 10 say https                491
## # … with 138,866 more rows

Well, many of these tweets are clearly about the common core, but beyond that it’s a bit hard to tell because there are so many “stop words” like “the,” “to,” “and,” “in” that don’t carry much meaning by themselves.

2c. Stop Words

Often in text analysis, we will want to remove these stop words if they are not useful for an analysis. The stop_words dataset in the tidytext package contains stop words from three lexicons. We can use them all together, as we have here, or filter() to only use one set of stop words if that is more appropriate for a certain analysis.

Let’s take a closer the lexicons and stop words included in each:

datatable(stop_words)

The anti_join Function

In order to remove these stop words, we will use a function called anti_join() that looks for matching values in a specific column from two datasets and returns rows from the original dataset that have no matches like so:

For a good overview of the different dplyr joins see here: https://medium.com/the-codehub/beginners-guide-to-using-joins-in-r-682fc9b1f119

Now let’s remove stop words that don’t help us learn much about what people are saying about the state standards.

tidy_unigrams <- anti_join(ccss_unigrams,
                         stop_words,
                         by = "word")

tidy_unigrams
## # A tibble: 169,711 x 4
##    id                  author_id         created_at          word               
##    <chr>               <chr>             <dttm>              <chr>              
##  1 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 textual            
##  2 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 evidence           
##  3 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 #gatheringblue     
##  4 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 #ccss              
##  5 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 #rl71              
##  6 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 #rl72              
##  7 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 #w72               
##  8 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 #middleschool      
##  9 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 #interactivenotebo…
## 10 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 🍎📚📝✅💯❤️🚀     
## # … with 169,701 more rows

Notice that we’ve specified the by = argument to look for matching words in the word column for both data sets and remove any rows from the tweet_tokens dataset that match the stop_words dataset. Remember when we first tokenized our dataset I conveniently chose output = word as the column name because it matches the column name word in the stop_words dataset contained in the tidytext package. This makes our call to anti_join()simpler because anti_join() knows to look for the column named word in each dataset. However this wasn’t really necessary since word is the only matching column name in both datasets and it would have matched those columns by default.

Filtering Bigrams

As we saw above, a lot of the most common bigrams are pairs of common (uninteresting) words as well. Dealing with these is a little less straightforward and we’ll need to use the separate() function from the tidyr package, which splits a column into multiple based on a delimiter. This lets us separate it into two columns, “word1” and “word2,” at which point we can remove cases where either is a stop-word.

library(tidyr)

bigrams_separated <- ccss_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")

bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

tidy_bigrams <- bigrams_filtered %>%
  unite(bigram, word1, word2, sep = " ")


Custom Stop Words

Before wrapping up, let’s take a quick count of the most common unigrams and bigrams to see if the results are a little more meaningful:

tidy_unigrams %>%
  count(word, sort = TRUE)
## # A tibble: 39,086 x 2
##    word            n
##    <chr>       <int>
##  1 #commoncore  9852
##  2 #ccss        1843
##  3 math         1750
##  4 amp          1520
##  5 #education   1064
##  6 students     1053
##  7 common       1011
##  8 core          987
##  9 education     850
## 10 standards     846
## # … with 39,076 more rows
tidy_bigrams %>% 
  count(bigram, sort = TRUE)
## # A tibble: 63,513 x 2
##    bigram                     n
##    <chr>                  <int>
##  1 https t.co             10538
##  2 commoncore https         940
##  3 common core              935
##  4 commoncore math          749
##  5 commoncore commonsense   497
##  6 commonsense knowledge    488
##  7 compact style            488
##  8 power site:is            488
##  9 writing common           488
## 10 t.co sxw9rhgwzm          470
## # … with 63,503 more rows

Notice that the nonsense word “amp” is among our high frequency words. Let’s add a filter to our previous code similar to what we did with our bigrams to remove rows with “amp” in them:

tidy_unigrams <-
  ccss_unigrams %>%
  anti_join(stop_words, by = "word") %>%
  filter(!word == "amp")

Note that we could extend this filter to weed out any additional words that don’t carry much meaning but skew our data by being so prominent.

✅ Comprehension Check

Tidy your my_tweets dataset from the ✅ Comprehension Check in Section 2a by tokenizing your text into unigrams and removing stop words.

Also, since we created some unnecessarily lengthy code to demonstrate some of the steps in the tidying process, try to use a more compact series of functions and assign your data frame to my_tidy_tweets.

3. EXPLORE

As highlighted in DSEIUR and Learning Analytics Goes to School, calculating summary statistics, data visualization, and feature engineering (the process of creating new variables from a dataset) are a key part of exploratory data analysis. In Section 3, we will calculate some very basic summary statistics from our tidied text, explore key words of interest to gather additional context, and use data visualization to identify patterns and trends that may not be obvious from our tables and numerical summaries. Topics addressed in Section 3 include::

  1. c

  2. Time Series. We take a quick look at the date range of our tweets and compare number of postings by standards.

3a. Count Words

People new to text mining are often disillusioned when they figure out how it’s actually done — which is still, in large part, by counting words. They’re willing to believe that computers have developed some clever strategy for finding patterns in language — but think “surely it’s something better than that?

The quote above from Word Counts are Amazing by Ted Underwood…

Recall from the previous section that one overarching question guiding most of our efforts in these text mining labs is: “How do we quantify what a text is about?”


 library(wordcloud2)

tidy_unigrams %>%
  count(word) %>%
  filter(n > 200) %>%
  wordcloud2()
tidy_unigrams %>%
  count(word) %>%
  filter(n > 500) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word)) +
  geom_col()


ccss_tweets %>%
  select(text) %>% 
  filter(grepl('math', text)) %>%
  sample_n(20)
## # A tibble: 20 x 1
##    text                                                                         
##    <chr>                                                                        
##  1 "States have educational standards and the federal government creates the #C…
##  2 "Parent Survey - PRINT and DIGITAL by Martha's Resource Corner on Teachers P…
##  3 "I just heard @fivebelow raising prices on some items, they must’ve just hea…
##  4 "@sgrant3350 @brithume He learned #commoncore math. Need I say more https://…
##  5 "@Mitchell_Bay7 Forgive us, must have been using #CommonCore math 😉--it actu…
##  6 "#MathTeachers check out #STEMvideohall &amp; learn how these federally fund…
##  7 "You would think with \"two million\" followers on a social media platform, …
##  8 "#Florida @GovRonDeSantis announces new statewide ed standards to get rid of…
##  9 "Can I just say I don't like #GoogleClassroom. And also, if I'm going to hav…
## 10 "@nmontague22 @HEIRMJ You use #commoncore math to arrive at that answer?"    
## 11 "How the fuck do you do common core math for 37-19???? For the love of God!!…
## 12 "What’s the matter with you people anyway ?\n\n@TheDemocrats are using the #…
## 13 "Me when my kids tell me \"That's not the way my math teacher wants it solve…
## 14 "1️⃣ What is the scale factor of the 5 different cars if the middle one repres…
## 15 "Thanks @ddburkey for demonstratig that engineers have been doing #commoncor…
## 16 "@KimberlyMrsRR1 @AOC @ILMFOrg #OccasionalCortex was schooled on #CommonCore…
## 17 "STRUGGLE FREE DIVISION AND MULTIPLICATION OF FRACTIONS by Number Sense Guy …
## 18 "Uploaded my first Anki deck to help kids practice sight-reading numbers 1-2…
## 19 "@RealJamesWoods Maybe he's using #CommonCore mathematics😂"                 
## 20 "Tennessee Tech University #cookeville #tn #tennessee #middleschool #creativ…


unigram_counts <- tidy_unigrams %>%
  count(word) %>%
  filter(n > 200)

wordcloud2(unigram_counts,
           color = ifelse(unigram_counts[, 2] > 800, 'black', 'gray'))



3b. Graph Pairs

tidy_bigrams %>%
  count(bigram, sort = TRUE)
## # A tibble: 63,513 x 2
##    bigram                     n
##    <chr>                  <int>
##  1 https t.co             10538
##  2 commoncore https         940
##  3 common core              935
##  4 commoncore math          749
##  5 commoncore commonsense   497
##  6 commonsense knowledge    488
##  7 compact style            488
##  8 power site:is            488
##  9 writing common           488
## 10 t.co sxw9rhgwzm          470
## # … with 63,503 more rows


library(igraph)
## 
## Attaching package: 'igraph'
## The following object is masked from 'package:tidyr':
## 
##     crossing
## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
library(ggraph)
library(stringr)

math_bigrams <- ccss_tweets %>%
  filter(str_detect(text, 'math')) %>%
  unnest_tokens(bigram, 
                text, 
                token = "ngrams", 
                n = 2)

bigrams_separated <- math_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")

bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

bigram_graph <- bigrams_filtered %>% 
  count(word1, word2, sort = TRUE) %>%
  filter(n > 10) %>%
  graph_from_data_frame()


set.seed(2017)

ggraph(bigram_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n)) +
  geom_node_point() +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1)



4. MODEL

library(widyr)

math_unigrams <- ccss_tweets %>%
  filter(str_detect(text, 'math')) %>%
  unnest_tokens(word, 
                text, 
                token = "tweets")
## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
word_pairs <- math_unigrams %>%
  pairwise_count(word, id, sort = TRUE)
## Warning: `distinct_()` was deprecated in dplyr 0.7.0.
## Please use `distinct()` instead.
## See vignette('programming') for more help
## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## Please use `tibble::as_tibble()` instead.


word_cors <- math_unigrams %>%
  group_by(word) %>%
  filter(n() >= 20) %>%
  pairwise_cor(word, id, sort = TRUE)

word_cors %>%
  filter(item1 == "math")
## # A tibble: 331 x 3
##    item1 item2  correlation
##    <chr> <chr>        <dbl>
##  1 math  i           0.141 
##  2 math  they        0.116 
##  3 math  core        0.109 
##  4 math  is          0.104 
##  5 math  common      0.103 
##  6 math  must        0.0933
##  7 math  thats       0.0930
##  8 math  think       0.0918
##  9 math  do          0.0914
## 10 math  me          0.0897
## # … with 321 more rows


word_cors <- tidy_unigrams %>%
  group_by(word) %>%
  filter(n() >= 50) %>%
  pairwise_cor(word, id, sort = TRUE)


word_cors %>%
  filter(item1 %in% c("math", "#math")) %>%
  group_by(item1) %>%
  slice_max(correlation, n = 6) %>%
  ungroup() %>%
  mutate(item2 = reorder(item2, correlation)) %>%
  ggplot(aes(item2, correlation)) +
  geom_bar(stat = "identity") +
  facet_wrap(~ item1, scales = "free") +
  coord_flip()


word_cors %>%
  filter(correlation > .15) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_void()
## Warning: ggrepel: 186 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps